Authors
Affiliation

Alessandro Pizzi

University of Lausanne

Andrea Lovato

Ayman El Abed

Illia Dorofieiev

Published

December 10, 2024

Abstract

Obesity has become a global health crisis, contributing to rising rates of non-communicable diseases and placing significant strain on healthcare systems worldwide. In this study, we explore the factors influencing obesity in Mexico, Peru, and Colombia through an analysis of a mixed dataset consisting of 77% synthetically generated data and 23% real-world data from 498 participants. Using data cleaning, visualization, and advanced modeling techniques, we identify key lifestyle and behavioral contributors to obesity, including dietary habits, physical activity, and demographic factors. The study employs linear regression to predict BMI and logistic regression to classify obesity, providing valuable insights into the relationship between these factors and obesity levels. While the findings are limited by the use of synthetic data and a non-representative sample, they underscore the importance of data-driven approaches in addressing public health challenges. This report aims to demonstrate the application of theoretical knowledge in a simulated environment and lay the groundwork for future studies targeting obesity reduction strategies.

1 Introduction

1.1 Project Goals

Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.

This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.

While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.

1.2 Research Questions

  • Question 1

    What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?

  • Question 2

    Can we predict whether a person will be obese based on some given combinations of factors?

  • Question 3

    How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?

2 Data

2.1 Sources

The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.

2.2 Description

The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors. In greater detail, an interactive table was designed to provide a comprehensive summary of the dataset’s variables.

Code
library(here)
library(knitr)
# Main features of the dataset
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
head(dataset_raw) %>%
  kbl(format = "html", caption = "First 6 Rows of the Initial Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
First 6 Rows of the Initial Dataset
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 no Public_Transportation Normal_Weight
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight
Code
desc_table <- tibble(
  Name = colnames(dataset_raw),
  Type = sapply(dataset_raw, class)
)
desc_table %>%
  kbl(format = "html", caption = "Variable Descriptions") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  column_spec(1, bold = TRUE) %>%
  scroll_box(width = "100%", height = "500px")
Variable Descriptions
Name Type
Gender character
Age numeric
Height numeric
Weight numeric
family_history_with_overweight character
FAVC character
FCVC numeric
NCP numeric
CAEC character
SMOKE character
CH2O numeric
SCC character
FAF numeric
TUE numeric
CALC character
MTRANS character
NObeyesdad character

The dataset underwent a thorough preprocessing phase, including normalization of continuous variables, encoding of categorical data, and removal of missing or atypical entries to ensure high-quality analysis. Class imbalance was addressed using the SMOTE (Synthetic Minority Oversampling Technique), generating synthetic data while carefully avoiding noise or artificial patterns. The final dataset comprises 77% synthetic data, which enhances balance and diversity, and 23% real-world data, adding authenticity. This combination allows for a comprehensive analysis of obesity-related factors, while recognizing potential biases, such as inaccuracies in self-reported information.

2.3 Wrangling

Essential libraries for data manipulation, visualization, and clustering are loaded to begin the wrangling process and support subsequent analysis. Each package is utilized for its specific functionality, facilitating efficient and streamlined analysis:

  • dplyr: for data manipulation (e.g., filtering, summarizing);

  • tidyr: for data tidying (e.g., reshaping);

  • ggplot2: for visualization;

  • corrplot: for correlation matrix visualization;

  • ggridges: for creating ridge plots;

  • cluster: for clustering algorithms;

  • reshape2: for data reshaping, especially during visualization.

Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(ggridges)
library(cluster)
library(reshape2)

Column names are renamed to enhance clarity and improve usability during the analysis. The updated names are designed to be shorter and more intuitive, ensuring ease of reference while retaining their original meaning and context. This adjustment simplifies code readability and helps streamline data manipulation tasks, particularly in complex analytical workflows.

Code
  dataset <- dataset_raw %>%
  rename(
    family_hist = family_history_with_overweight,
    obesity_lev = NObeyesdad,
    caloric_food = FAVC,
    vegetable_food = FCVC,
    nb_meal_day = NCP,
    food_btw_meals = CAEC,
    ch2o = CH2O,
    smoke = SMOKE,
    calorie_check = SCC,
    physical_act = FAF,
    freq_alcohol = CALC,
    use_tech = TUE,
    m_trans = MTRANS,
    gender = Gender,
    age = Age,
    weight = Weight,
    height = Height
  )

The structure of the dataset is examined to identify the data types of each variable, providing critical insights for subsequent data preparation. Understanding the data types helps pinpoint columns requiring transformations, such as converting categorical variables to factors or standardizing numeric variables for analysis.

Code
str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)
str_table %>%
  kbl(format = "html", caption = "Original structure of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
Original structure of the Dataset
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : chr "Female" "Female" "Male" "Male" ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : chr "yes" "yes" "yes" "no" ...
$ caloric_food : chr "no" "no" "no" "no" ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
$ smoke : chr "no" "yes" "no" "no" ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : chr "no" "yes" "no" "no" ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : chr "no" "Sometimes" "Frequently" "Frequently" ...
$ m_trans : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
$ obesity_lev : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
Code
dataset <- dataset %>%
  mutate(
    gender = as.factor(gender),
    family_hist = as.factor(family_hist),
    caloric_food = as.factor(caloric_food),
    smoke = as.factor(smoke),
    calorie_check = as.factor(calorie_check),
    m_trans = as.factor(m_trans),
    obesity_lev = factor(obesity_lev, 
                         levels = c("Insufficient_Weight", "Normal_Weight", 
                                    "Overweight_Level_I", "Overweight_Level_II", 
                                    "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), 
                         ordered = TRUE),
    food_btw_meals = factor(ifelse(food_btw_meals == "no", "No", food_btw_meals), 
                            levels = c("No", "Sometimes", "Frequently", "Always"), 
                            ordered = TRUE),
    freq_alcohol = factor(ifelse(freq_alcohol == "no", "No", freq_alcohol), 
                          levels = c("No", "Sometimes", "Frequently", "Always"), 
                          ordered = TRUE))


str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)
str_table %>%
  kbl(format = "html", caption = "Dataset Structure") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Dataset Structure
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
$ caloric_food : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: Ord.factor w/ 4 levels "No"<"Sometimes"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : Ord.factor w/ 4 levels "No"<"Sometimes"<..: 1 2 3 3 2 2 2 2 3 1 ...
$ m_trans : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
$ obesity_lev : Ord.factor w/ 7 levels "Insufficient_Weight"<..: 2 2 2 3 4 2 2 2 2 2 ...

The transformations ensured the dataset was ready for analysis by restructuring categorical and ordinal variables to meet modeling requirements. Converting categorical variables into factors standardized their representation, reducing ambiguity and improving compatibility with statistical models. For ordinal variables, levels were explicitly ordered to preserve their logical progression and enhance interpretability, allowing for meaningful comparisons across categories.

The updated structure was reviewed to confirm the accuracy of these adjustments, providing confidence in the preprocessing steps. While further transformations like normalization were not applied, the focus on categorical and ordinal adjustments established a strong foundation for reliable and interpretable analysis. In particular, the levels of obesity categories, food consumption between meals, and frequency of alcohol use were arranged to reflect increasing severity or frequency, ensuring these variables captured their intended relationships and supported clear, accurate insights into the data.

Now, a numerical version of the dataset, called “dataset_num”, is created by transforming categorical variables into numerical values, ensuring compatibility with statistical analyses while maintaining logical relationships and interpretability. This numerical transformation is specifically essential for developing the correlation matrix, as it requires all variables to be in numeric format to analyze their relationships effectively.

The presence of potential missing values in the transformed dataset is checked and visualized to confirm data integrity and ensure no issues have been introduced during the conversion process.

Code
dataset_num <- dataset %>%
  mutate(obesity_lev = recode(obesity_lev,
                              "Insufficient_Weight"=1,
                              "Normal_Weight" = 2,
                              "Overweight_Level_I" = 3,
                              "Overweight_Level_II" = 4,
                              "Obesity_Type_I" = 5,
                              "Obesity_Type_II" = 6,
                              "Obesity_Type_III" = 7,
  ))

dataset_num <- dataset %>%
  mutate(freq_alcohol = recode(freq_alcohol,
                               "No"=1,        
                               "Sometimes"=2, 
                               "Frequently" =3,
                               "Always"  =4 
  ))

dataset_num <- dataset %>%
  mutate(m_trans = recode(m_trans,
                          "Automobile"=1,
                          "Bike"=2,
                          "Motorbike"=3,
                          "Public_Transportation"=4,
                          "Walking"=5,
  ))

dataset_num <- dataset %>%
  mutate(food_btw_meals = recode(food_btw_meals,
                                 "No"=0,
                                 "Sometimes"=1 ,
                                 "Frequently"=2,
                                 "Always"=3
  )
  )

dataset_num <- dataset_num%>%
  mutate(calorie_check = recode(calorie_check,
                                "no"=0,
                                "yes"=1 ,
  ))

dataset_num <- dataset_num %>%
  mutate(across(where(is.factor), ~ as.numeric(.)))


str_output <- capture.output(str(dataset_num))
table_num_str <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

table_num_str %>%
  kbl(format = "html", caption = "Structure of the Numerical Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Structure of the Numerical Dataset
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : num 1 1 2 2 2 2 1 2 2 2 ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : num 2 2 2 1 1 1 2 1 2 2 ...
$ caloric_food : num 1 1 1 1 1 2 2 1 2 2 ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: num 1 1 1 1 1 1 1 1 1 1 ...
$ smoke : num 1 2 1 1 1 1 1 1 1 1 ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : num 0 1 0 0 0 0 0 0 0 0 ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : num 1 2 3 3 2 2 2 2 3 1 ...
$ m_trans : num 4 4 4 5 4 1 3 4 4 4 ...
$ obesity_lev : num 2 2 2 3 4 2 2 2 2 2 ...
Code
nb_na<- colSums(is.na(dataset_num))
nb_na %>%
  kbl(format = "html", caption = "Presence of Potential NA Values in the Dataset") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), 
    full_width = FALSE, 
    position = "left"
  ) %>%
  column_spec(1, width = "100px") %>%
  column_spec(2, width = "80px") %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Presence of Potential NA Values in the Dataset
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

The test results confirmed the absence of any NA values in the dataset, indicating that all variables were successfully converted to numeric format without compromising data integrity.

2.4 Spotting Mistakes and Missing Data

Check for missing values

To ensure data integrity, missing values in the dataset are examined by counting “NA” values in each column, providing a clear view of dataset completeness. The results are presented in a formatted table for easy interpretation, with styling applied for readability and a scrollable box to handle larger datasets. This process facilitates prompt handling of missing data through appropriate strategies.

Code
missing_values <- colSums(is.na(dataset))
missing_values %>%
  kbl(format = "html", caption = "Missing Values in Each Column") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width=FALSE, position = "center") %>%
  column_spec(1, width = "100px") %>%
  column_spec(2, width = "80px") %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
Missing Values in Each Column
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

The analysis confirms that all columns contain complete data, with no missing values identified. This completeness ensures a robust foundation for subsequent analysis, eliminating the need for immediate data cleaning related to missing entries.

Check for duplicates

The dataset is examined for duplicated rows to ensure data integrity and eliminate redundancy. Identifying and addressing duplicates is a crucial step in data preprocessing, as redundant entries can skew analysis results and lead to misleading conclusions. This process involves systematically scanning the dataset for identical rows and quantifying their occurrence.

Code
duplicated_rows <- sum(duplicated(dataset))
duplicated_rows
[1] 24

The detection of 24 duplicated rows in the dataset highlights the need for further preprocessing to ensure data integrity, as these redundant entries could skew analysis if not properly handled.

Code
dataset <- dataset %>%
distinct()

nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE

The dataset was refined by removing duplicate entries to ensure that only unique rows are retained. A verification step confirmed that no duplicates remain, ensuring the dataset’s integrity and reliability for further analysis.

2.5 Listing Anomalies and Outliers

A bar chart was created to visualize the distribution of obesity levels, providing a clear overview of class frequencies within the dataset. Particular attention is given to obesity levels, as this variable serves as the dependent variable in the predictive model to be developed later.

Code
g1 <- ggplot(dataset, aes(x = obesity_lev)) +
  geom_bar(fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Class Distribution of Obesity Levels",
    x = "Obesity Level",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Adjusted the text for clarity

plotly_plot <- ggplotly(g1)
plotly_plot

The chart highlights a balanced distribution across all obesity levels, demonstrating the effectiveness of SMOTE in addressing class imbalance. By equalizing the representation of each category, the dataset becomes more reliable for analysis, reducing biases and ensuring a fair evaluation of patterns within the data. On the other hand, the synthetic data introduced by SMOTE may not fully reflect real-world variability, potentially leading to artificial patterns that could affect the interpretability of results.

A density plot was generated to visualize the age distribution across different obesity levels, providing insights into patterns and trends within the data.

Code
g2 <- ggplot(dataset, aes(x = age, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
    labs(
    title = "Age Distribution by Obesity Levels",
    x = "Age",
    y = "Density",
    fill = "Obesity Level")
plotly_plot1 <- ggplotly(g2)
plotly_plot1

This graph provides a detailed view of the age distribution across obesity levels and offers insight into the impact of the SMOTE algorithm in balancing the dataset. The distributions show distinct separation among obesity categories, with younger ages predominantly associated with lower obesity levels (e.g., Insufficient Weight and Normal Weight), while older ages are more prevalent in higher obesity categories (e.g., Obesity Type II and III).

Notably, sharp peaks in the density curves, such as the one around age 30 in “Obesity Type I,” could indicate potential artifacts introduced during the synthetic data generation process. While these patterns align with logical demographic trends, they highlight the need for further validation to ensure that such separations and peaks represent realistic population characteristics rather than biases from data augmentation. Overall, the dataset reflects clear and interpretable patterns, but these observations suggest the importance of cautious interpretation and robust validation in subsequent analyses.

Summary statistics were computed for key variables across obesity levels to identify potential anomalies or patterns, providing a clearer understanding of how age, height, and weight vary within each category.

Code
dataset_stat <- dataset %>%
  group_by(obesity_lev) %>%
  summarize(
    Age_Mean = mean(age, na.rm = TRUE),
    Age_SD = sd(age, na.rm = TRUE),
    Height_Mean = mean(height, na.rm = TRUE),
    Height_SD = sd(height, na.rm = TRUE),
    Weight_Mean = mean(weight, na.rm = TRUE),
    Weight_SD = sd(weight, na.rm = TRUE)
  )
dataset_stat %>%
  kbl(format = "html", caption = "Summary Statistics by Obesity Level", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Summary Statistics by Obesity Level
obesity_lev Age_Mean Age_SD Height_Mean Height_SD Weight_Mean Weight_SD
Insufficient_Weight 19.8 2.7 1.7 0.1 50.0 6.0
Normal_Weight 21.8 5.1 1.7 0.1 62.2 9.3
Overweight_Level_I 23.5 6.3 1.7 0.1 74.5 8.6
Overweight_Level_II 27.0 8.1 1.7 0.1 82.1 8.5
Obesity_Type_I 25.9 7.8 1.7 0.1 92.9 11.5
Obesity_Type_II 28.2 4.9 1.8 0.1 115.3 8.0
Obesity_Type_III 23.5 2.8 1.7 0.1 120.9 15.5

The summary statistics reveal distinct differences across obesity levels. As expected, weight increases progressively with higher obesity categories, accompanied by slightly larger variations in standard deviation. Interestingly, height remains relatively constant across categories, suggesting it plays a limited role in distinguishing obesity levels. The age distribution shows a notable shift, with younger individuals dominating the lower obesity levels and a broader age range in higher levels, highlighting potential demographic patterns worth further exploration. These insights confirm the logical trends in the dataset, providing confidence in its structure while emphasizing the need for further analysis of these relationships.

Clustering was performed using k-means to explore the dataset’s structure and assess the coherence of the groups, with the silhouette score calculated to evaluate the quality and separation of the clusters.

Code
library(cluster)
set.seed(123)
kmeans_res <- kmeans(select(dataset, where(is.numeric)), centers = length(unique(dataset$obesity_lev)))
silhouette_score <- silhouette(kmeans_res$cluster, dist(select(dataset, where(is.numeric))))
mean_silhouette_score <- mean(silhouette_score[, "sil_width"])
mean_silhouette_score
[1] 0.4513519

The mean silhouette score of approximately 0.456 indicates moderate cohesion within clusters and reasonable separation between them. This suggests that the clusters, representing different obesity levels, are distinguishable but not excessively isolated. The result reflects a balance between natural class separability and the effects of data augmentation with SMOTE, which appears to have effectively balanced the dataset without introducing significant distortions. These findings provide confidence in the dataset’s suitability for clustering-based exploration while highlighting the importance of further validation to ensure the robustness of the observed patterns.

2.6 Correlation Analysis

To explore relationships among variables and their association with obesity levels, a correlation matrix was computed. The analysis focuses on identifying the strength and direction of correlations between “obesity_lev” (the dependent variable) and other predictors, such as physical activity, frequency of alcohol consumption, and dietary habits. By ordering variables based on their correlation with “obesity_lev”, the matrix highlights the most influential factors in determining obesity levels. A heatmap visualization was then created to provide an intuitive representation of these relationships, with a gradient color scale indicating the strength of positive and negative correlations. This approach facilitates the identification of key variables for further analysis and modeling.

Code
#Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>% select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight","height", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech", "ch2o", "m_trans", "smoke","nb_meal_day", "calorie_check", "gender"),use = "complete.obs")

#Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

#Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

#Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

#Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

g3 <- ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) + 
    geom_tile() + 
    geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5, hjust = 0.5) +
    scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
    labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        axis.text.y = element_text(angle = 45, vjust = 1))

plot3 <- ggplotly(g3)
plot3
Code
# Create the heatmap with correlation values

# Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>% select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech","ch2o", "height", "calorie_check", "gender"), use = "complete.obs")

# Extract the correlations with "obesity_lev"
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

# Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

# Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

# Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

g4 <- ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        axis.text.y = element_text(angle = 45, vjust = 1) 
  )
plot4 <- ggplotly(g4)
plot4

The correlation matrices provide valuable insights into the relationships between variables and their association with obesity levels. As expected, weight exhibits a very strong positive correlation with obesity level, reinforcing its central role in defining the target variable. Family history of obesity and caloric food consumption also show moderate positive correlations, highlighting their relevance as predictive factors.

Conversely, variables such as physical activity and food consumption between meals exhibit weak or negative correlations, suggesting that their influence on obesity levels is less pronounced. These patterns align with logical trends but also underscore the need for careful consideration of multicollinearity and the relative importance of variables in predictive modeling. The heatmap’s clear organization of variables by their correlation strength aids in identifying the most impactful factors for further analysis. Overall, the results confirm that the dataset’s structure supports a robust examination of the factors influencing obesity.

3 Exploratory Data Analysis (EDA)

The Exploratory Data Analysis (EDA) phase of the project was designed to uncover meaningful patterns and insights while ensuring the dataset was optimized for analysis. A correlation heatmap was employed early in the process to identify and assess relationships between variables. By comparing the initial and refined versions of the heatmap, we effectively filtered out less relevant variables, allowing the analysis to focus on the most impactful features. This step not only streamlined the dataset but also enhanced its interpretability, ensuring a more targeted exploration of key patterns.

Key variables retained after this step were selected based on their strong correlations with the target variable and their potential relevance to underlying patterns in the data.

The EDA process involved a systematic exploration of the cleaned and refined dataset, utilizing visualization tools to highlight trends, distributions, and potential anomalies.

3.0.0.1 Descriptive statistics and distribution analysis

The initial phase of the Exploratory Data Analysis (EDA) concentrated on examining the most strongly correlated variables: Age, Height, and Weight. These variables were prioritized due to their direct relevance and significant relationships with the target outcomes, as highlighted in the refined correlation heatmap.

3.0.0.1.1 Age
Code
age_summary <- summary(dataset$age)
age_sd <- sd(dataset$age, na.rm = TRUE)
sum_age_df <- tibble::tibble(
  Metric = c(names(age_summary), "Standard Deviation"),
  Value = round(c(age_summary, age_sd), 2)
)
kable(sum_age_df, format = "markdown", caption = "Age Variable Statistics")
Age Variable Statistics
Metric Value
Min. 14.00
1st Qu. 19.92
Median 22.85
Mean 24.35
3rd Qu. 26.00
Max. 61.00
Standard Deviation 6.37

The age variable exhibits a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years, indicating a slight asymmetry toward younger ages. The range spans from 14 to 61 years, though the majority of individuals fall within the 20–30 age group. A standard deviation of 6.35 years reflects moderate variability in age across the dataset. This predominantly young sample may introduce limitations when generalizing findings to older populations, where obesity-related factors might differ significantly.

Age Distribution by Obesity Level

Code
g5 <- ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  labs(title = "Age Distribution by Obesity Level", x = "Obesity Level", y = "Age") +
  theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot5 <- ggplotly(g5)
plot5

The violin plot highlights the age distribution across obesity levels, illustrating that individuals with insufficient or normal weight are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels, such as Obesity Type I and Type II, display a broader age range, peaking around 30–40 years. Severe obesity (Type III) is rare among younger individuals but becomes more prevalent in mid-adulthood. This visualization emphasizes the gradual increase in obesity risk with age, underlining the importance of early intervention, particularly during early and mid-adulthood, when such risks are most pronounced.

Age Distribution with SMOOTH Trend Line for Obesity Probability

Code
g6 <- ggplot(dataset, aes(x = age, y = as.numeric(obesity_lev))) +
  geom_jitter(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Trend of Obesity Level with Age", x = "Age", y = "Obesity Level") +
  theme_minimal()
plot6 <- ggplotly(g6)
plot6

Complementing this, the trend line graph further captures the trajectory of obesity levels with age. A sharp rise in obesity is observed from adolescence to early adulthood, peaking in the 25–30 years range. This critical transition phase is likely influenced by lifestyle factors such as reduced physical activity, increased caloric intake, and metabolic changes. After this peak, the trend reveals a gradual decline in obesity levels beyond 30 years, potentially reflecting improved health awareness, dietary adjustments, or a selection bias in older populations. These insights underscore the mid-20s to early-30s as a pivotal stage for targeted obesity prevention and intervention strategies.

3.0.0.1.2 Height
Code
height_summary <- summary(dataset$height)
height_sd <- sd(dataset$height, na.rm = TRUE)
sum_height_df <- tibble::tibble(
  Metric = c(names(height_summary), "Standard Deviation"),
  Value = round(c(height_summary, height_sd), 2)
)
kable(sum_height_df, format = "markdown", caption = "Height Variable Statistics")
Height Variable Statistics
Metric Value
Min. 1.45
1st Qu. 1.63
Median 1.70
Mean 1.70
3rd Qu. 1.77
Max. 1.98
Standard Deviation 0.09

Height distribution

Code
g7 <- ggplot(dataset, aes(x = height)) +
  geom_histogram(bins = 20, fill = "purple", color = "black", alpha = 0.7) +
  labs(title = "Height Distribution", x = "Height (m)", y = "Count") +
  theme_minimal()
plot7 <- ggplotly(g7)
plot7

The height distribution, as shown in the histogram, follows an approximately normal shape with a slight right skew. Most values range between 1.45m and 1.98m, peaking around 1.8m, which represents the most common height. Both the mean and median are 1.7m, confirming a nearly symmetrical distribution. The standard deviation of 0.09 indicates low variability, and no extreme outliers are observed, highlighting a realistic and consistent dataset for height.

Box Plot of Height by Obesity Level

Code
g8 <- ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +
  geom_violin(alpha = 0.6) +
  labs(title = "Height Distribution by Obesity Level", x = "Obesity Level", y = "Height") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
plot8 <- ggplotly(g8)
plot8

The violin plot further explores the height distribution across obesity levels. Each category exhibits relatively low variability, with overlapping ranges across groups. Insufficient and Normal Weight categories have slightly narrower distributions, centered around 1.7m. As obesity levels increase, from Obesity Type I to Type III, the distributions remain consistent, indicating that height does not significantly vary with obesity classification. These findings suggest that while height remains a stable feature, weight likely plays a more decisive role in determining obesity levels.

3.0.0.1.3 Weight
Code
sum_weight_df <- tibble::tibble(
  Metric = c(names(summary(dataset$weight)), "Std. Dev"),
  Value = round(c(summary(dataset$weight), sd(dataset$weight, na.rm = TRUE)), 2)
)
kable(sum_weight_df, format = "markdown", caption = "Weight Variable Statistics")
Weight Variable Statistics
Metric Value
Min. 39.00
1st Qu. 66.00
Median 83.10
Mean 86.86
3rd Qu. 108.02
Max. 173.00
Std. Dev 26.19

Density plot for weight distribution by gender

Code
g9 <- ggplot(dataset, aes(x = weight, fill = gender)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Weight by Gender", x = "Weight", y = "Density") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Gender", labels = c("Female", "Male")) +
  theme_minimal()
plot9 <- ggplotly(g9)
plot9

The density plot highlights distinct differences in weight distribution between genders. Females generally exhibit lower weights, with a peak around 70 units, whereas males show peaks at 85 and 115 units, reflecting a tendency toward higher weights. An overlapping region between 80 and 90 units indicates common weight ranges for both genders, though the distinct peaks underscore gender-based differences. Weight in the dataset ranges from 39 to 173 units, with an average of 86.6 units, a median of 83 units, and a standard deviation of 26.2, indicating moderate variability.

Ridgeline Plot of Weight by Obesity Level.

Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +
  geom_density_ridges(scale = 0.9, alpha = 0.6) +
  labs(title = "Ridgeline Plot of Weight by Obesity Level", x = "Weight", y = "Obesity Level") +
  theme_minimal() +
  theme(legend.position = "none")

Code
# can't seem to make the interractive plot work 

The ridgeline plot further illustrates the relationship between weight and obesity levels. As obesity levels increase, the weight distribution shifts consistently toward higher values. Categories such as “Insufficient Weight” and “Normal Weight” cluster at lower ranges, while higher obesity types (I, II, and III) peak at significantly greater weights. This clear progression confirms a strong positive association between weight and obesity levels, reinforcing the centrality of weight in obesity classification. The dataset’s average weight remains at 86.6 units with a standard deviation of 26.6, capturing the variability across different obesity categories.

3.0.0.1.4 Height and Weight

Researching deeply the relationship between height and weight, trends across obesity levels were examined using scatter plots, providing critical insights that reinforce theoretical expectations while contextualizing weight variations within height ranges for different obesity classifications.

Scatter Plot (height vs weight), colored by obesity level

Code
g11 <- ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = obesity_lev)) +  # Adds a trend line for each obesity level
  ggtitle("Scatter Plot of Weight vs Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level")
plot11 <- ggplotly(g11)
plot11

Facet Grid for Height and Weight by Obesity Level

Code
g12 <- ggplot(dataset, aes(x = height, y = weight)) +
  geom_point(alpha = 0.7, aes(color = obesity_lev)) +
  facet_wrap(~ obesity_lev) +
  ggtitle("Facet Grid of Weight and Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level") +
  theme(legend.position = "none")
plot12 <- ggplotly(g12)
plot12

The first scatter plot presents an overview, illustrating the general trend of increasing weight with height, stratified by obesity levels. To better isolate and visualize these individual trends, the initial graph is expanded into a facet grid, offering a clearer perspective on the separate trends within each obesity category and highlighting distinct relationships and ranges.

Correlation between height and weight

Code
correlation_height_weight <- cor(dataset$height, dataset$weight, use = "complete.obs")
correlation_height_weight
[1] 0.457468

The observed correlation between height and weight (r = 0.463) aligns with findings in existing literature, confirming a moderate positive relationship and reinforcing the expectation that taller individuals generally weigh more, though the strength of this association varies slightly across obesity levels.

With the analysis of age, height, and weight completed, attention shifts to exploring the remaining variables in the dataset. These variables, while less directly correlated with the target outcomes, offer critical insights into behavioral, lifestyle, and environmental factors that may influence obesity levels.

3.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levels
g13 <- ggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +
   labs(x = "Food Between Meals", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14))

plot13 <- ggplotly(g13)
plot13
Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)
g14 <- ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +
    geom_bar(position = "fill") + # Stacked bar chart with proportions
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # Format y-axis as percentages
    ggtitle("Proportion of Food Between Meals Across Obesity Levels") + # Shortened and clear title
    labs(x = "Obesity Levels", y = "Proportion (%)", fill = "Food Between Meals") + # Correct axis and legend labels
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis text for readability
        plot.title = element_text(hjust = 0.5, size = 14) # Center and style the title
    )
plot14 <- ggplotly(g14)
plot14

The charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.

3.0.0.1.6 High-caloric food consumption
Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Counts)
g16 <- ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +
  geom_bar(
    position = "dodge",
    color = "black"
  ) +
  ggtitle("Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +
  labs(x = "Obesity Levels", y = "Count", fill = "High-Caloric Food Consumption") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14)
  )
plot16 <- ggplotly(g16)
plot16

The grouped bar chart highlights a clear trend of increased high-caloric food consumption as obesity levels rise. High-caloric food consumption (“yes”) dominates across all categories, surpassing 75% of responses and becoming nearly universal in higher obesity levels (Obesity Type I–III). In contrast, “no” responses are more prominent in lower categories like Insufficient Weight and Normal Weight but remain relatively infrequent.

Code
percentage_high_caloric_consumers <- mean(dataset$caloric_food == "yes") * 100
percentage_high_caloric_consumers
[1] 88.35649

More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, a behavior strongly associated with weight gain. This underscores the critical importance of dietary interventions aimed at reducing high-calorie intake to address obesity progression effectively.

3.0.0.1.7 Alcohol consumption

Frequence in consumption of alcohol

Code
# Filter out "Always" responses from the dataset
filtered_dataset <- dataset %>%
  filter(freq_alcohol != "Always")

# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")
g17 <- ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +
   labs(x = "Alcohol Consumption Frequency", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

plot17 <- ggplotly(g17)
plot17

Regarding alcohol consumption, the chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.

Code
# Prepare the data summary for 'Sometimes' and 'No' responses
data_summary <- dataset %>%
  filter(freq_alcohol %in% c("Sometimes", "No")) %>%
  group_by(obesity_lev, freq_alcohol) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(obesity_lev) %>%
  mutate(
    total = sum(count),
    proportion = count / total
  ) %>%
  ungroup()

# Visualization with updated title
g18 <- ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +  # Format y-axis as percentages
  ggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +
  labs(x = "Obesity Level", y = "Proportion (%)", color = "Alcohol Frequency") +
  scale_color_manual(values = c("No" = "purple", "Sometimes" = "gold")) + # Improved color scheme
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14),  # Center and style title
    legend.position = "top"
  )
plot18 <- ggplotly(g18)
plot18

To better illustrate the trends in alcohol consumption frequency across obesity levels, this graph was created to highlight the shifting proportions of individuals consuming alcohol “Sometimes” and abstaining (“No”). The proportion of individuals who drink alcohol “Sometimes” shows a steady increase with higher obesity levels, peaking in Obesity_Type_III. Conversely, the proportion of those who abstain from alcohol decreases as obesity levels rise, suggesting an inverse relationship between abstention and obesity severity.

This pattern raises questions about the potential interaction between alcohol consumption frequency and caloric food preferences, as both behaviors appear to be associated with higher obesity levels. Investigating this interaction could provide insights into whether a combination of moderate alcohol consumption and high-calorie food preferences exerts a compounded effect on obesity risk. Understanding these combined lifestyle factors could inform strategies aimed at mitigating obesity progression more effectively.

3.0.0.1.8 Daily Calorie Monitoring
Code
# Dodged Bar Chart for calorie_check by Obesity Levels
g19 <- ggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for the check of the calories by Obesity Levels") +
   labs(x = "High-Caloric Food Consumption", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title
plot19 <- ggplotly(g19)
plot19
Code
data_summary <- dataset %>%
  group_by(obesity_lev, calorie_check) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(total = sum(count), proportion = count / total)

# Proportion of Calorie Checking by Obesity Level
g20 <- ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("no" = "lightcoral", "yes" = "lightblue")) +
  labs(title = "Proportion of Calorie Checking by Obesity Level", x = "Obesity Level", y = "Proportion", color = "Calorie Check") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
plot20 <- ggplotly(g20)
plot20

The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The proportion graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.

3.0.0.1.9 Vegetable consumption
Code
g21 <- ggplot(dataset, aes(x = vegetable_food)) +
  geom_histogram(aes(y =after_stat(density)), bins = 30, fill = "lightgreen", color = "black", alpha = 0.6) +
  geom_density(color = "darkgreen", linewidth = 1) +
  ggtitle("Histogram and Density of Vegetable Food Consumption") +
  theme_minimal() +
  labs(x = "Vegetable Food Consumption", y = "Density")
plot21 <- ggplotly(g21)
plot21
Code
g22 <- ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Vegetable Food Consumption", 
         x = "Weight", 
         y = "Vegetable Food Consumption") +
    theme_minimal() +
    coord_cartesian(xlim= c(40, 135), ylim= c(2, 3))
plot22 <- ggplotly(g22)
plot22

The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.

This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.

The trend reveals two possible key insights:

  • A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
  • The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
3.0.0.1.10 Physical activity
Code
g23 <- ggplot(dataset, aes(x = physical_act)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", linewidth = 1) +
  ggtitle("Histogram and Density of Physical Activity") +
  theme_minimal() +
  labs(x = "Physical Activity", y = "Density")
plot23 <- ggplotly(g23)
plot23

The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.

Violin plot by category

Code
g24 <- ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +  
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
    ggtitle("Violin Plot of Physical Activity by Obesity Level") +
  labs(x = "Obesity Level", y = "Physical Activity") +
  theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot24 <- ggplotly(g24)
plot24

Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.

The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.

3.0.0.1.11 Water consumption
Code
g25 <- ggplot(dataset, aes(x = ch2o)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Comsumption of Water") +
  theme_minimal() +
  labs(x = "CH2O", y = "Density")
plot25 <- ggplotly(g25)
plot25

This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.

Trend Line of Weight vs Water Consumption

Code
# Scatterplot with a LOESS trend line
g26 <- ggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Water Consumption", x = "Weight", y = "Water Consumption (ch2o)") +
    theme_minimal() +
coord_cartesian(xlim= c(35, 135))
plot26 <- ggplotly(g26)
plot26

The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.

This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.

3.0.0.1.12 Technology utilization

Density of Use of Technology by Obesity Level

Code
g28 <- ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Use of Technology by Obesity Level", x = "Use of Technology", y = "Density") +
  theme_minimal()
plot28 <- ggplotly(g28)
plot28

This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.

The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.

This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.

The Exploratory Data Analysis (EDA) phase provided a comprehensive understanding of the dataset, offering key insights into the relationships between various behavioral, lifestyle, and demographic factors and obesity levels. By focusing on critical variables the EDA revealed patterns and trends that are integral to the modeling process.

4 Analysis

The analysis phase is dedicated to the development, refinement, and comprehensive evaluation of the predictive models, meticulously designed to directly address the previously defined research questions.

4.1 Methods

The modeling process is structured to address the two key research questions:

  1. identifying the most significant lifestyle and behavioral factors contributing to obesity in Mexico, Peru, and Colombia;

  2. predicting whether a person will be obese based on some given combinations of factors.

4.1.1 Linear Regression Model

A linear regression model will be developed to predict an individual’s BMI using weight and height as predictors, reflecting their foundational role in BMI calculation. As emphasized by Mendoza Palechor and De La Hoz Manotas (2019), these variables are fundamental to understanding body composition and are directly tied to the dataset’s variable of obesity levels. By focusing on BMI as a continuous outcome, this approach complements categorical classifications by capturing more detailed variations in body composition across the population. The model will be evaluated using standard regression metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R², ensuring its predictive accuracy and reliability while providing a robust foundation for public health applications.

4.1.2 Logistic Regression Model

To accurately address the key research questions, a logistic regression model will be employed to estimate the probability of individuals belonging to a categorie: obese or not obese. Weight and height will be excluded as predictors in the model because they are directly used to calculate BMI, which serves as the basis for the obesity levels categorized in the dataset. Including these variables would create a dependency between the predictors and the target variable, potentially biasing the analysis. By excluding weight and height, the focus shifts to behavioral and lifestyle factors, such as dietary habits, physical activity, and demographic characteristics, to better understand their influence on obesity risk.

While logistic regression provides a clear and interpretable framework for estimating probabilities, it inherently limits the analysis to a binary classification. This restriction prevents the exploration of the full spectrum of obesity levels, such as Obesity Type I, II, or III, as classified in the dataset. Despite this limitation, logistic regression is a robust method for quantifying the relationships between independent variables and the binary outcome. Feature selection techniques will ensure that only the most relevant predictors are retained, and the model’s performance will be rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, ensuring reliable and actionable insights.

4.1.3 Insights and Limitations

Regression analysis helps us understand how predictors influence outcomes, with logistic regression classifying individuals as obese or not obese and linear regression predicting BMI as a continuous variable. The dataset offers a mix of advantages and challenges: synthetic data ensures balanced representation but lacks the complexity of real-world patterns, while user-collected data adds variability but is prone to biases. Logistic regression simplifies the analysis by focusing on binary outcomes, leaving out the nuanced gradations of obesity, and assumes linearity, which may not fully capture complex relationships. Linear regression relies on accurate weight and height data, making it sensitive to reporting errors. Despite these limitations, the models offer insights into obesity risk and body composition, serving as a valuable exercise and foundation for future projects, even if not directly applicable to real-world scenarios.

4.2 Goals for Each Method

4.2.1 Linear Regression Model Development

Data Loading and Processing

The dataset was imported, and initial exploration was conducted to understand its structure. BMI was calculated as a key variable, and missing values were addressed by removing incomplete rows. Boxplots were used to visualize the distributions of key variables, ensuring the dataset was ready for analysis.

Code
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
head(dataset_raw)
  Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female  21   1.62   64.0                            yes   no    2   3
2 Female  21   1.52   56.0                            yes   no    3   3
3   Male  23   1.80   77.0                            yes   no    2   3
4   Male  27   1.80   87.0                             no   no    3   3
5   Male  22   1.78   89.8                             no   no    2   1
6   Male  29   1.62   53.0                             no  yes    2   3
       CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
1 Sometimes    no    2  no   0   1         no Public_Transportation
2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
4 Sometimes    no    2  no   2   0 Frequently               Walking
5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
6 Sometimes    no    2  no   0   0  Sometimes            Automobile
           NObeyesdad
1       Normal_Weight
2       Normal_Weight
3       Normal_Weight
4  Overweight_Level_I
5 Overweight_Level_II
6       Normal_Weight
Code
summary(dataset_raw)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight     FAVC                FCVC      
 Length:2111                    Length:2111        Min.   :1.000  
 Class :character               Class :character   1st Qu.:2.000  
 Mode  :character               Mode  :character   Median :2.386  
                                                   Mean   :2.419  
                                                   3rd Qu.:3.000  
                                                   Max.   :3.000  
      NCP            CAEC              SMOKE                CH2O      
 Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
 1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
 Median :3.000   Mode  :character   Mode  :character   Median :2.000  
 Mean   :2.686                                         Mean   :2.008  
 3rd Qu.:3.000                                         3rd Qu.:2.477  
 Max.   :4.000                                         Max.   :3.000  
     SCC                 FAF              TUE             CALC          
 Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
 Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
                    Mean   :1.0103   Mean   :0.6579                     
                    3rd Qu.:1.6667   3rd Qu.:1.0000                     
                    Max.   :3.0000   Max.   :2.0000                     
    MTRANS           NObeyesdad       
 Length:2111        Length:2111       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
Code
dataset_raw$BMI <- dataset_raw$Weight / (dataset_raw$Height^2)
summary(dataset_raw$BMI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   24.33   28.72   29.70   36.02   50.81 
Code
if (sum(is.na(dataset_raw)) > 0) {cat("Missing values detected! Removing rows with NA.\n")
  dataset_raw <- na.omit(dataset_raw)}
boxplot(dataset_raw$Weight, main = "Weight Distribution", col = "grey", border = "black", notch = TRUE, horizontal = TRUE, xlab = "Weight (kg)", ylim = c(30, 200))
grid(nx = NULL, ny = NULL, lty = 0.5, col = "black")

Code
boxplot(dataset_raw$Height, main = "Height Distribution", col = "lightgreen", border = "darkgreen", notch = TRUE, ylab = "Height (m)", ylim = c(1.4, 2))

Model Development

A linear regression model was built to examine the relationship between BMI, weight, and height. The model summary provided key performance metrics and insights into variable contributions.

Code
linear_model <- lm(BMI ~ Weight + Height, data = dataset_raw)
summary(linear_model)

Call:
lm(formula = BMI ~ Weight + Height, data = dataset_raw)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0892 -0.3809  0.1300  0.4007  2.4948 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.626e+01  3.455e-01   162.8   <2e-16 ***
Weight       3.403e-01  7.767e-04   438.1   <2e-16 ***
Height      -3.292e+01  2.180e-01  -151.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8282 on 2108 degrees of freedom
Multiple R-squared:  0.9893,    Adjusted R-squared:  0.9893 
F-statistic: 9.767e+04 on 2 and 2108 DF,  p-value: < 2.2e-16

Evaluation

The model was evaluated by generating predictions and calculating key performance metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics assess the model’s accuracy and its ability to explain the variance in BMI.

Code
library(Metrics)
predictions <- predict(linear_model, newdata = dataset_raw)
head(predictions)
       1        2        3        4        5        6 
24.70397 25.27386 23.20182 26.60432 28.21541 20.96121 
Code
mse <- mean((dataset_raw$BMI - predictions)^2)
rmse <- rmse(dataset_raw$BMI, predictions)
r_squared <- summary(linear_model)$r.squared
cat("MSE:", mse, "\nRMSE:", rmse, "\nR²:", r_squared, "\n")
MSE: 0.6848888 
RMSE: 0.8275801 
R²: 0.9893238 

Results Visualization

A scatter plot was created to compare actual BMI values with predicted values. A reference line (ideal fit) was added to assess the alignment of predictions with true values. The plot provides a visual representation of the model’s accuracy.

Code
plot(dataset_raw$BMI, predictions, xlab = "Actual BMI", ylab = "Predicted BMI", main = "Comparison between actual and predicted BMI", pch = 16, col = "blue", cex = 0.6) 
abline(0, 1, col = "red", lwd = 2)
legend("topleft", legend = c("Predicted Values", "Regression Line"), col = c("blue", "red"), pch = c(16, NA), lty = c(NA, 1), bty = "n")

Diagnostics

Diagnostic plots were generated to evaluate the linear regression model’s assumptions, including residual patterns, normality, and variance consistency. A histogram of residuals was also created to assess their distribution, with a vertical reference line highlighting the zero-residual point.

Code
par(mfrow = c(2, 2), mar = c(4, 4, 2.5, 2), cex.main = 1.3, cex.lab = 1, cex.axis = 1)
plot(linear_model, col = "Blue",pch = 19, cex = 0.2, lwd = 2)

Code
hist(residuals(linear_model), col = "Gray", border = "white", main = "Residuals Distribution", xlab = "Residuals", ylab = "Frequency", breaks = 15, cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2)
abline(v = 0, col = "red", lwd = 2, lty = 2)

4.2.2 Logistic Regression Model Development

Data Loading and Processing

To align with the requirements of a logistic regression model, it was necessary to modify the dataset’s target variable. The original variable, obesity level, was a multi-class categorical variable representing varying degrees of obesity and non-obesity. Since logistic regression is designed for binary classification, the target variable was converted into a binary format. Individuals with a BMI ≥ 30 were classified as obese (1), while others were classified as non-obese (0). This transformation ensured compatibility with the logistic regression framework. Following this adjustment, the target variable was converted into a factor, and the dataset was reviewed for consistency and readiness for analysis.

Code
dataset_raw$BMI <- dataset_raw$Weight / (dataset_raw$Height^2)
head(dataset_raw$BMI)
[1] 24.38653 24.23823 23.76543 26.85185 28.34238 20.19509
Code
summary(dataset_raw$BMI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   24.33   28.72   29.70   36.02   50.81 
Code
dataset_raw$Obesity <- ifelse(dataset_raw$BMI >= 30, 1, 0)
dataset_raw$Obesity <- as.factor(dataset_raw$Obesity)
table(dataset_raw$Obesity)

   0    1 
1137  974 
Code
head(dataset_raw)
  Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female  21   1.62   64.0                            yes   no    2   3
2 Female  21   1.52   56.0                            yes   no    3   3
3   Male  23   1.80   77.0                            yes   no    2   3
4   Male  27   1.80   87.0                             no   no    3   3
5   Male  22   1.78   89.8                             no   no    2   1
6   Male  29   1.62   53.0                             no  yes    2   3
       CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
1 Sometimes    no    2  no   0   1         no Public_Transportation
2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
4 Sometimes    no    2  no   2   0 Frequently               Walking
5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
6 Sometimes    no    2  no   0   0  Sometimes            Automobile
           NObeyesdad      BMI Obesity
1       Normal_Weight 24.38653       0
2       Normal_Weight 24.23823       0
3       Normal_Weight 23.76543       0
4  Overweight_Level_I 26.85185       0
5 Overweight_Level_II 28.34238       0
6       Normal_Weight 20.19509       0

Twelve predictors associated with obesity-related behaviors, dietary habits, physical activity, and lifestyle factors were selected for analysis. These variables, along with the binary target variable Obesity (1 = obese, 0 = not obese), formed the dataset for logistic regression modeling. The dataset was reviewed to ensure correct structure and readiness for analysis.

Code
predictors <- c("family_history_with_overweight", "FAVC", "FCVC", "NCP", "CAEC", "SMOKE", "CH2O", "SCC", "FAF", "TUE", "CALC", "MTRANS")
model_data <- dataset_raw[, c("Obesity", predictors)]
str(model_data)
'data.frame':   2111 obs. of  13 variables:
 $ Obesity                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
 $ FAVC                          : chr  "no" "no" "no" "no" ...
 $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
 $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
 $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
 $ SMOKE                         : chr  "no" "yes" "no" "no" ...
 $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
 $ SCC                           : chr  "no" "yes" "no" "no" ...
 $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
 $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
 $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
 $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...

Model Development

Three regression models were employed to ensure a systematic and robust approach to predictor selection and model development. The null model, containing only the intercept, served as a baseline to represent predictions without the influence of any predictors. This provided a reference point to evaluate how much additional explanatory power was gained by including predictors.

The full model, incorporating all predictors, represented the maximum complexity allowable within the dataset. This model helped understand the potential contribution of each variable but carried the risk of overfitting due to its complexity.

The stepwise model, guided by the Akaike Information Criterion (AIC), balanced the simplicity and performance of the model. By iteratively evaluating the inclusion or exclusion of predictors, the stepwise procedure identified the subset of variables that significantly contributed to explaining the outcome while minimizing unnecessary complexity. This process ensured that the final model retained only the most relevant predictors, achieving optimal fit and generalizability. Using these three models allowed for a thorough comparison and the development of a parsimonious and effective predictive model.

Code
full_model <- glm(Obesity ~ ., data = model_data, family = binomial)
null_model <- glm(Obesity ~ 1, data = model_data, family = binomial)
stepwise_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "both", trace = FALSE)

Presented below is a comprehensive overview of the logistic regression models.

Full Model

Code
summary(full_model)

Call:
glm(formula = Obesity ~ ., family = binomial, data = model_data)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       -15.36505  324.74530  -0.047 0.962263    
family_history_with_overweightyes   3.69719    0.37458   9.870  < 2e-16 ***
FAVCyes                             2.07091    0.25824   8.019 1.06e-15 ***
FCVC                                0.88012    0.11247   7.826 5.05e-15 ***
NCP                                 0.03708    0.07752   0.478 0.632407    
CAECFrequently                     -2.06362    0.59237  -3.484 0.000495 ***
CAECno                             -0.77655    0.89991  -0.863 0.388185    
CAECSometimes                       1.26856    0.45850   2.767 0.005662 ** 
SMOKEyes                            1.05465    0.46099   2.288 0.022151 *  
CH2O                                0.16385    0.10057   1.629 0.103271    
SCCyes                             -2.65656    0.63212  -4.203 2.64e-05 ***
FAF                                -0.32582    0.07185  -4.535 5.76e-06 ***
TUE                                -0.40060    0.09648  -4.152 3.29e-05 ***
CALCFrequently                      5.93159  324.74480   0.018 0.985427    
CALCno                              6.51658  324.74462   0.020 0.983990    
CALCSometimes                       6.66914  324.74464   0.021 0.983615    
MTRANSBike                          0.28608    1.43312   0.200 0.841778    
MTRANSMotorbike                     1.60556    0.93874   1.710 0.087203 .  
MTRANSPublic_Transportation         0.56905    0.12959   4.391 1.13e-05 ***
MTRANSWalking                      -1.90350    0.65161  -2.921 0.003486 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2913.9  on 2110  degrees of freedom
Residual deviance: 1934.8  on 2091  degrees of freedom
AIC: 1974.8

Number of Fisher Scoring iterations: 11

Null Model

Code
summary(null_model)

Call:
glm(formula = Obesity ~ 1, family = binomial, data = model_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.15474    0.04366  -3.544 0.000394 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2913.9  on 2110  degrees of freedom
Residual deviance: 2913.9  on 2110  degrees of freedom
AIC: 2915.9

Number of Fisher Scoring iterations: 3

Stepwise Model

Code
summary(stepwise_model)

Call:
glm(formula = Obesity ~ family_history_with_overweight + CAEC + 
    FAVC + FCVC + MTRANS + SCC + FAF + TUE + SMOKE + CH2O, family = binomial, 
    data = model_data)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       -8.81742    0.71456 -12.340  < 2e-16 ***
family_history_with_overweightyes  3.69727    0.37458   9.870  < 2e-16 ***
CAECFrequently                    -2.04508    0.58732  -3.482 0.000498 ***
CAECno                            -0.68606    0.89498  -0.767 0.443339    
CAECSometimes                      1.35486    0.45623   2.970 0.002981 ** 
FAVCyes                            2.11404    0.25671   8.235  < 2e-16 ***
FCVC                               0.89359    0.11159   8.008 1.17e-15 ***
MTRANSBike                         0.37745    1.44019   0.262 0.793254    
MTRANSMotorbike                    1.64596    0.94208   1.747 0.080611 .  
MTRANSPublic_Transportation        0.59885    0.12817   4.672 2.98e-06 ***
MTRANSWalking                     -1.85268    0.65036  -2.849 0.004390 ** 
SCCyes                            -2.66952    0.63102  -4.230 2.33e-05 ***
FAF                               -0.34384    0.07050  -4.877 1.08e-06 ***
TUE                               -0.42405    0.09552  -4.440 9.01e-06 ***
SMOKEyes                           1.02529    0.45514   2.253 0.024280 *  
CH2O                               0.16845    0.09937   1.695 0.090040 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2913.9  on 2110  degrees of freedom
Residual deviance: 1940.7  on 2095  degrees of freedom
AIC: 1972.7

Number of Fisher Scoring iterations: 7

Evaluation

To evaluate the stepwise-selected model, predicted probabilities of obesity were generated for all individuals. These probabilities were converted into binary classifications using a threshold of 0.5. A confusion matrix was constructed to assess the model’s performance, providing key metrics such as accuracy, sensitivity, specificity, precision, and F1-score.

Code
# predicted_probs <- predict(stepwise_model, type = "response")
# predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)
# conf_matrix <- confusionMatrix(as.factor(predicted_classes), model_data$Obesity)
# print(conf_matrix)
Code
library(caret)
library(pROC)
# Predict probabilities from the stepwise model
predicted_probs <- predict(stepwise_model, type = "response")

# Convert probabilities to binary classes (using a threshold of 0.5)
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)

# Create a confusion matrix (comparison of predicted vs. actual values)
conf_matrix <- confusionMatrix(as.factor(predicted_classes), model_data$Obesity)
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 772 141
         1 365 833
                                          
               Accuracy : 0.7603          
                 95% CI : (0.7415, 0.7784)
    No Information Rate : 0.5386          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5256          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6790          
            Specificity : 0.8552          
         Pos Pred Value : 0.8456          
         Neg Pred Value : 0.6953          
             Prevalence : 0.5386          
         Detection Rate : 0.3657          
   Detection Prevalence : 0.4325          
      Balanced Accuracy : 0.7671          
                                          
       'Positive' Class : 0               
                                          
Code
# Compute ROC curve using the actual class labels and predicted probabilities
roc_curve <- pROC::roc(model_data$Obesity, predicted_probs)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Code
auc_value <- roc_curve$auc
cat("AUC:", round(auc_value, 3), "\n")
AUC: 0.858 

Additionally, the ROC curve and AUC were calculated to further evaluate the model’s discriminative ability. The ROC curve visualizes the trade-off between sensitivity and specificity, while the AUC quantifies the model’s ability to distinguish between obese and non-obese individuals.

Code
# roc_curve <- roc(model_data$Obesity, predicted_probs)
# auc_value <- auc(roc_curve)
# cat("AUC:", round(auc_value, 3),"\n")
# AUC: 0.858

Results Visualization

To assess the distribution of predicted probabilities, a scatter plot was created with observations color-coded by their actual class. This visualization provides a clear overview of the model’s predictions and potential misclassifications.

Code
plot(predicted_probs, col = ifelse(model_data$Obesity == 1, "blue", "red"), pch = 16, xlab = "n° Observation", ylab = "Predicted Probability", main = "Predicted Probabilities of Obesity", cex = 0.6)
legend("bottomright", legend = c("Obese", "Not Obese"), col = c("blue", "red"), pch = 16)

For additional clarity, the ROC curve was plotted to visually represent the model’s performance.

Code
plot(roc_curve, col = "blue", main = "ROC Curve", lwd = 3, xlim = c(0, 1), ylim = c(0, 1.05), xlab = "False Positive Rate", ylab = "True Positive Rate", cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.1)
legend("topright", legend = paste("AUC =", round(auc_value, 3)), lwd = 0, cex = 1.2, bty = "n")
grid()

Predicting Obesity Probability

To test the model’s ability to predict the probability of individuals becoming obese, six distinct profiles were created, representing a diverse range of lifestyles. Each profile was carefully designed to highlight specific behavioral, dietary, and lifestyle patterns.

The first individual represents a high-risk case for obesity. This person has a family history of being overweight, frequently consumes high-calorie foods and snacks, and eats very few vegetables. They have five meals a day, drink only 0.5 liters of water daily, and do no physical activity. Additionally, they spend 10 hours a day using technology, consume alcohol consistently, and rely primarily on public transportation for mobility.

The second individual exemplifies a very healthy lifestyle. They have no family history of being overweight, rarely consume high-calorie foods or snacks, and eat a large amount of vegetables. Their diet consists of very few meals per day, complemented by a high water intake of 4 liters daily. They do not monitor calorie intake but engage in physical activity five times a week. They walk as their primary mode of transportation, do not consume alcohol, and spend only 0.5 hours daily using technology.

The third individual exhibits a balanced lifestyle but shows some risk factors. This person has a family history of being overweight, frequently consumes snacks and high-calorie foods, and eats a moderate amount of vegetables. They have three meals a day, drink 1 liter of water, and monitor their calorie intake. However, they engage in physical activity only once a week, use technology for 8 hours daily, use motorbike as transportation vehicle, and occasionally consume alcohol.

The fourth individual is physically active and health-conscious. They have no family history of being overweight, do not frequently consume high-calorie foods, but snack occasionally. They eat a lot of vegetables, have a small number of meals per day, and drink 3 liters of water daily. They do not monitor calorie intake but exercise three times a week and use a bicycle for transportation. They consume alcohol frequently but spend only an hour daily using technology.

The fifth individual represents another high-risk case due to a sedentary lifestyle. They have a family history of being overweight, frequently consume high-calorie foods and snacks, and eat very few vegetables. They have four meals a day, drink 2 liters of water, and do no physical activity. They spend 6 hours daily using technology, consume alcohol moderately, and rely on public transportation.

The sixth individual leads a very active lifestyle but has some risk factors due to alcohol and transportation choices. They have no family history of being overweight, do not frequently consume high-calorie foods or snacks, and eat a large amount of vegetables. They have two meals per day, drink 1.5 liters of water, and do not monitor calorie intake. However, they engage in physical activity four times a week, use a motorbike for transportation, do not consume alcohol, and spend 2 hours daily using technology.

These six profiles were designed to test the model’s capacity to handle a wide variety of real-world scenarios, ensuring it can effectively predict obesity probabilities across diverse populations.

Code
new_data <- data.frame(
  family_history_with_overweight = factor(c("yes", "no", "yes", "no", "yes", "no"), 
                                          levels = c("yes", "no")),
  FAVC = factor(c("yes", "no", "yes", "no", "yes", "no"), 
                levels = c("yes", "no")),
  FCVC = c(1, 5, 2, 4, 1, 3),
  NCP = c(5, 1, 3, 2, 4, 2),
  CAEC = factor(c("Frequently", "Sometimes", "Always", "Sometimes", "Frequently", "Always"), 
                levels = c("Frequently", "Sometimes", "Always")),
  SMOKE = factor(c("no", "yes", "no", "yes", "yes", "no"), 
                 levels = c("yes", "no")),
  CH2O = c(0.5, 4, 1, 3, 2, 1.5),
  SCC = factor(c("yes", "no", "yes", "no", "yes", "no"), 
               levels = c("yes", "no")),
  FAF = c(0, 5, 1, 3, 0, 4),
  TUE = c(10, 0.5, 8, 1, 6, 2),
  CALC = factor(c("Always", "Never", "Sometimes", "Always", "Sometimes", "Never"), 
                levels = c("Sometimes", "Frequently", "Always", "Never")),
  MTRANS = factor(c("Public_Transportation", "Walking", "Motorbike", "Bike", "Public_Transportation", "Motorbike"), 
                  levels = c("Public_Transportation", "Walking", "Bike", "Motorbike"))
)

predicted_probs <- predict(stepwise_model, newdata = new_data, type = "response")
predicted_probs
           1            2            3            4            5            6 
3.090608e-05 6.185149e-03 2.987690e-03 3.119962e-02 6.045868e-04 1.560071e-03 

4.3 Results

4.3.1 5. Conclusion

So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.

The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.

These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.

4.4 Next Steps

Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.

4.5 Final Thoughts

Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.